Skip to content

AWS maintenance#1810

Draft
andrewkern wants to merge 3 commits intopopsim-consortium:mainfrom
andrewkern:aws_maint
Draft

AWS maintenance#1810
andrewkern wants to merge 3 commits intopopsim-consortium:mainfrom
andrewkern:aws_maint

Conversation

@andrewkern
Copy link
Member

this is a workflow that aims to securely deal with uploading new artifacts (genetic maps / annotations) to the AWS site.

  • Adds a local helper script (maintenance/upload_to_s3.sh) that validates uploads against the
    catalog before triggering the workflow
  • AWS credentials are now stored as GitHub repo secrets, accessible only through a protected
    environment requiring admin approval

A maintainer runs the local script pointing at a tarball and the catalog entry it corresponds to:

./maintenance/upload_to_s3.sh data/HapMapII_GRCh38.tar.gz HomSap genetic_map HapMapII_GRCh38

The script:

  1. Looks up the species, genetic map/annotation in the stdpopsim catalog
  2. Verifies the local file's SHA256 matches the catalog entry
  3. Creates a temporary draft GitHub release to transport the file
  4. Triggers the GitHub Actions workflow

The workflow then:

  1. Checks out the repo and re-validates against the catalog
  2. Downloads the asset and verifies SHA256 again
  3. Checks that the S3 destination doesn't already exist (no overwrites)
  4. Uploads to S3
  5. Cleans up the draft release

This is in a draft state currently and will need some review

@codecov
Copy link

codecov bot commented Mar 3, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 99.81%. Comparing base (8bc753e) to head (93d8169).

Additional details and impacted files
@@           Coverage Diff           @@
##             main    #1810   +/-   ##
=======================================
  Coverage   99.81%   99.81%           
=======================================
  Files         142      142           
  Lines        4873     4873           
  Branches      472      472           
=======================================
  Hits         4864     4864           
  Misses          6        6           
  Partials        3        3           

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@andrewkern andrewkern requested a review from nspope March 3, 2026 20:43
@andrewkern
Copy link
Member Author

tagging you @nspope -- if you could help to review this that would be amazing

@andrewkern
Copy link
Member Author

@petrelharp -- heads up, it looks like the pyslim docs site is returning a 404

Copy link
Collaborator

@nspope nspope left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I get the gist here (and it's a clever way to automate uploading without sharing credentials), but have a couple questions about the big picture.


# --- Validate against stdpopsim catalog ---
echo "Validating against stdpopsim catalog..."
VALIDATION=$(python3 << PYEOF
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i'm probably being dense here, but why do we need to recreate this validation here, if the same code is being run as an action? Shouldn't it be one or the other?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, this is a fair point. My thinking was the local validation gives fast feedback before creating a draft release and triggering the workflow. The workflow validation is the security gate (since someone could bypass the script). But it is duplicated code. This is the "belt and suspenders" defensive approach I was talking to you about.

We could remove the local validation and let the workflow be the single source of truth, but this would be a slower feedback loop

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I vote for checking this twice!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For instance, it's relatively easy for the workflow to be accidentally turned off; doing it locally is I think great.

# ./maintenance/upload_to_s3.sh --dry-run data/test.tar.gz HomSap genetic_map HapMapII_GRCh38
#
# The script looks up the resource in the stdpopsim catalog to find the
# expected S3 URL and SHA256, verifies the local file matches, then uploads
Copy link
Collaborator

@nspope nspope Mar 4, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"verifies the local file matches" makes it seem like the file has already been uploaded; but that can't be the case right?

I think it'd help clarify if you spelled out the workflow on a bit more, something like

  1. Create a catalogue entry for the new resource, in a feature branch
  2. Use this script and point it towards the local path to the resource, it will validate it against the feature branch and upload it to AWS
  3. Merge the feature branch

Maybe this should go in the developer docs as well?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah this comment is ambiguous.... agreed this should be changed and added to the dev docs

echo "Or watch the latest run:"
echo " gh run watch --repo $REPO \$(gh run list --repo $REPO --workflow $WORKFLOW --limit 1 --json databaseId --jq '.[0].databaseId')"
echo ""
echo "Final S3 URL: $S3_URL"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you think it's worth uploading some metadata or otherwise trying to version what's on AWS? So that, if someone uses this script, we'll have a record of who/when/what? Or, I guess this would be tracked by the feature branch that adds the new resource?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good idea! I looked in to this quickly and see that we can include metadata with a call like:

  aws s3 cp file.tar.gz s3://bucket/path.tar.gz \
      --metadata '{"uploaded-by":"andrewkern","pr":"1810","timestamp":"2026-03-05"}'

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where does that metadata go? I was imagining @nspope was suggesting also uploading a JSON file.

expected_sha256 = resource.intervals_sha256

# Extract S3 path from URL (handles both s3-us-west-2 and s3.us-west-2 styles)
m = re.match(r"https://stdpopsim\.s3[.-]us-west-2\.amazonaws\.com/(.*)", s3_url)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm unclear why this is necessary?


echo ""
if [ "$DRY_RUN" = true ]; then
echo "Dry-run workflow triggered. It will validate everything but skip the S3 upload."
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should the dry-run also skip the release?

--repo "$REPO" \
--draft \
--title "S3 upload: $SPECIES_ID/$RESOURCE_TYPE/$RESOURCE_ID" \
--notes "Temporary draft release for S3 upload. Will be cleaned up automatically." \
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where does the advertised cleaning up happen?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah I see, here. But it won't get cleaned up if someone doesn't run the aws script. I'm a little unclear about what the "release" actually is, though.

exit 1
fi

# S3 destination must end with .tar.gz or .tgz
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about we also validate it's of the form _vN.tar.gz (as required)?

run: |
DEST="${{ inputs.s3_destination }}"
if aws s3api head-object --bucket stdpopsim --key "$DEST" 2>/dev/null; then
echo "ERROR: s3://stdpopsim/$DEST already exists. Refusing to overwrite."
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
echo "ERROR: s3://stdpopsim/$DEST already exists. Refusing to overwrite."
echo "ERROR: s3://stdpopsim/$DEST already exists. Refusing to overwrite."
echo "Do not replace or remove previously-released files!"

GH_TOKEN: ${{ secrets.GITHUB_TOKEN }}
run: |
TAG="${{ inputs.release_tag }}"
echo "Cleaning up draft release: $TAG"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should at least validate the form of $TAG here to make sure this isn't deleting something it shouldn't?

echo "SHA256: ${{ inputs.expected_sha256 }}"

- name: Dry run summary
if: ${{ inputs.dry_run == true }}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If someone runs --dry-run then won't the draft release will get deleted, so they won't be able to continue on to do the wet run?

@petrelharp
Copy link
Contributor

Rebasing should make the pyslim error go away.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants